</br>
<div id="intro", style="font-family:cursive">
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. This is basically Bay Wheel's Trip data. This dataset will require some data wrangling in order to make it tidy for analysis. There are multiple cities covered by the linked system. For the analysis on full year coverage 2019 trip data fromJanuary to December is taken.
Bay Wheels(previously known as Ford GoBike) is a regional public bicycle sharing system in the San Francisco Bay Area, California. Bay Wheels is the first regional and large-scale bicycle sharing system deployed in California. The dataset used for this exploratory analysis consists of monthly trip data from January 2019 to December 2019 covering the greater San Francisco Bay area.
</div></br>
# Import all the necessary python libraries for the exploration for plotting the visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import os
from zipfile import ZipFile
import glob
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Unzip all zip files downloaded from fordgobike site as system data for this analysis
# Unzip all the zip files stored in 'data' folder and after unzip we remove them from that folder
# The code of this cell is used to unzip the zip files programetically but do not need to run this cell again.
# folder='data'
# for item in os.listdir(folder):
# with ZipFile(folder + '/' + item) as zfile:
# zfile.extractall(folder)
# zfile.close()
# os.remove(folder + '/' + item)
# Check for the csv files those we extract from the zip files. csv files are stored in 'data' folder.
folder='data'
os.listdir(folder)
# Create dataframes from those csv files and concat them into a single dataframe
frames = [pd.read_csv(f) for f in glob.glob(os.path.join(folder, '*.csv'))]
trip_df = pd.concat(frames, ignore_index=True, sort=True);
trip_df.sample(10);
trip_df.info();
# Save the data of the dataframe as a csv file
trip_df.to_csv('trip_19.csv')
# Create a copy of the dataframe for exploration and explanation.
df = trip_df.copy()
df.head()
# Get summery about the dataframe
df.info(null_counts=True)
# Check the shape of dataframe
df.shape
# Check the values of bike_share_for_all_trip column.
df.bike_share_for_all_trip.unique()
# Check the values of bike_share_for_all_trip column.
df.rental_access_method.unique()
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above summery we can see that there are several columns for those datatype have to be changed. And We have create some extra columns for the shake of easy exploration.
</div>
# Change type of bike_id, start_station_id, end_station_id from int to string.
df.bike_id = df.bike_id.astype('str')
df.start_station_id = df.start_station_id.astype('str')
df.end_station_id = df.end_station_id.astype('str')
# Chage type of start_time and end_time from object to datetime.
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)
# Change type of rental_access_method, user_type and bike_share_for_all_trip columns to category.
df.rental_access_method = df.rental_access_method.astype('category')
df.user_type = df.user_type.astype('category')
df.bike_share_for_all_trip = df.bike_share_for_all_trip.astype('category')
# Check the types of the dataframe columns.
df.info()
# Get sample of the dataframe.
df.sample(10)
<div id="intro", style="font-size:18px; font-family:cursive">
- Now check for the columns' values and chage or modify if that is necessary. And create new columns for ease of exploration.
</div>
# Check how many trips atleast half an hour.
df.query('duration_sec >= 3600').count()
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see that there are comparetively less number of trips whose duration is more than 1 hour. So its better to convert all rides duration to minute instead of hour.
</div>
# Create a new column duration_minute with trip time in minute.
df['duration_minute'] = df['duration_sec'] / 60
df.duration_minute = df.duration_minute.round(3)
# Get the top 10 duration in hours.
df.duration_minute.nlargest(10)
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see that there is a outlier in duration_minute column that is 15201.833 minutes, that is too much with comparison to other values. So we have to remove the row that contain this value in duration_minute column.
</div>
df.duration_minute.nlargest(1).index[0]
# Drop that row
df.drop([2481276], inplace=True)
# Check that row is removed or not.
df.duration_minute.nlargest(3)
# Get start day, month, hour, weekday from the start_time column.
df['start_date'] = df['start_time'].dt.date
df['start_month'] = df['start_time'].dt.strftime('%B')
df['start_month_day'] = df['start_time'].dt.day.astype(int)
df['start_weekday'] = df['start_time'].dt.strftime('%A')
df['start_hour'] = df['start_time'].dt.strftime('%H').astype(int)
df['start_month_num'] = df['start_time'].dt.month.astype(int)
# Check the column status.
df.info()
# Check some rows of the dataframe
df.head()
<div id="intro", style="font-size:18px; font-family:cursive">
- Change datatype of newly created columns.
</div>
# Change datatype of start_month and start_weekday to category.
df.start_month = df.start_month.astype('category')
df.start_weekday = df.start_weekday.astype('category')
# Create a function to get the seasons according to the start time.
# Then create a new column `season`.
def season_of_year(date):
if date in range(3,6):
return 'Spring'
if date in range(6,9):
return 'Summer'
if date in range(9,12):
return 'Autumn'
else:
return 'Winter'
# Apply function to start_month_num column creating a new season column
df['season'] = df.start_month_num.map(season_of_year)
# Check the values of `season` column.
df.season.value_counts()
<div id="intro", style="font-size:18px; font-family:cursive">
- Drop not usefull columns -
bike_share_for_all_trip,duration_sec,end_station_id,start_station_id,end_station_latitude,end_station_longitude,start_station_latitude,start_station_longitude.
</div>
# Drop the column those are not usefull for this analysis.
df.drop(['duration_sec', 'bike_share_for_all_trip', 'start_station_id',
'start_station_latitude', 'start_station_longitude', 'end_station_id',
'end_station_latitude', 'end_station_longitude'], axis=1, inplace=True)
# Check all the existing columns.
df.info()
# Store the clean and ready for exploration data in a csv file.
df.to_csv('trip_clean_19.csv')
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
The cleaned dataset contains the columns of interest for the analysis. This dataset consists of around 2.5M rows and 15 columns. We store this dataset to
trip_clean_19.csvfile for future analysis puprpose. This dataset contains the following features -
- bike_id
- end_station_name
- end_time
- rental_access_method
- start_station_name
- start_time
- user_type
- duration_minute
- start_month
- start_month_day
- start_weekday
- start_hour
- start_month_num
- season
- start_date
This dataset contains all the usefull informations those are there from before in the dataset or created sing the existing informations in the dataset. One can explore all interesting features using this dataset.
</div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
In this section, I investigate the trend of some variables in the data. Univariate exploration is necessary to determine the trend of the individual variables of the dataset you are working with. This initial study not only helps visualize trends, but allows determining outliers and points outside the norm.
</div>
# Here we set the values to the variables which will be used in the following analysis.
# Set base color for univariate exploration.
base_color = sb.color_palette()[0]
# Set the orders for categorical variables that will be used later.
month_order=['January','February','March','April','May','June','July','August','September','October','November','December']
season_order = ['Summer', 'Autumn', 'Winter', 'Spring']
day_order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Set ticks and tick_lable for start_hour.
hour_ticks = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
hour_ticks_labels = ["0am", "1am", "2am", "3am", "4am", "5am", "6am", "7am", "8am", "9am", "10am", "11am", "12pm",
"13pm", "14pm", "15pm", "16pm", "17pm", "18pm", "19pm", "20pm", "21pm", "22pm", "23pm"]
# Set the ticks and tick_lable for season.
season_ticks = [0,1,2,3]
season_ticks_label = ['Summer \n (Jun,Jul,Aug)', 'Autumn \n (Sep,Oct,Nov)',
'Winter \n (Dec,Jan,Feb)', 'Spring \n (Mar,Apr,May)']
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# See monthly statistics of trip demands.
sb.countplot(data=df, x='start_month', order=month_order, color=base_color);
# Set the size and background of the plot
sb.set(rc={'figure.figsize':(10,5)})
sb.set_style('whitegrid')
#for better visbility rotating the ticks by 90
plt.xticks(rotation=90)
plt.xlabel('Month', fontsize=16);
plt.ylabel('Number of Trips', fontsize=16);
plt.title('Monthly Rides Statistics\n', y=1.05, fontsize=18, fontweight='bold');
plt.suptitle('\nYear 2019\n', fontsize=14);
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see that during the month of 'December' demand is relatively low and in the month of 'July' demand is high.
- Generally there is up and down trends through out the years. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
df.season.value_counts()
# See the statistics of rides in the seasons. See is seasons make impact the interest of riders to ride??
sb.countplot(data=df, x='season', order=season_order, color=base_color);
# Set the figure size and label the plot
sb.set(rc={'figure.figsize':(10,5)});
sb.set_style('whitegrid');
plt.title('Rides Statistics Over the Seasons\n', fontsize=18);
plt.suptitle('\n\nYear 2019\n', fontsize=13);
plt.xlabel('Seaons', fontsize=16);
plt.ylabel('Number of Rides/Year', fontsize=16);
plt.xticks(season_ticks, season_ticks_label);
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above plot we can see that 'Spring' is more cofortable for riders
- In the 'Winter' season riders less prefer to ride. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
Here we see how trip vary through the week. And also see proportions of contributionin rides of each day. </div>
# Now see weekly report about the trip. Is this service during weekends less demanding??
sb.countplot(data=df, x='start_weekday', color=base_color, order=day_order);
plt.xlabel('Trip On Days of Week', fontsize=16);
plt.xticks(rotation=90);
plt.ylabel('Number of Trips', fontsize=16);
plt.title('Trip Statistics Through The Week', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see that during the weekends the demands decreases almost 50%. And Bay Wheel see least demand on Sunday of the Week. </div>
# Make a plot with relative frequency for weekly rides statistics.
# Get relative frequency
n_points = df.shape[0]
rel_fre = df.start_weekday.value_counts() / n_points
# Make the plot.
rel_fre.plot(kind='bar', fontsize=13, alpha=0.9);
# Label the plot.
plt.title('Proportional Bike Rides Over Weekdays\n', y=1.05, fontsize=18, fontweight='bold');
plt.suptitle('\n\nYear 2019\n', fontsize=13);
plt.xlabel('Weekdays\n', fontsize=14);
plt.ylabel('Relative Rides Counts\n', fontsize=14);
<div id="intro", style="font-size:18px; font-family:cursive">
From the above two plots we can observe-
- Uuring the weekends the demands decreases almost 50%. And Bay Wheel see least demand on Sunday of the Week. On Tuesday bay Wheel see highest demands.
- Around 80% of bike rides happen during working days(from Monday to Friday) and only 20% on weekends(Satarday and Sunday). </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# See hourly statistics of rides.
sb.countplot(data=df, x='start_hour', color=base_color);
sb.set(rc={'figure.figsize':(15,7)})
sb.set_style('whitegrid')
plt.xlabel('Hours of A Day', fontsize=16);
plt.ylabel('Trips Count', fontsize=16);
plt.title('Hourly Statistics of Trips of A Day', y=1.05, fontsize=18, fontweight='bold');
# Set the ticks on x-axis.
plt.xticks(hour_ticks,hour_ticks_labels);
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see during First half of a day 8 A.M. is the rush hour and in the second half 5 P.M. is the rush hour. </div>
# Create a new dataframe for pointplot.
hour_df = df.groupby('start_hour').agg({'bike_id':'count'}).reset_index()
hour_df['bike_id'] = (hour_df['bike_id']/hour_df['bike_id'].sum())*100
# Plot a pointplot for horly trip visualization.
plt.figure(figsize=(15,8))
sb.pointplot(data=hour_df, x='start_hour', y='bike_id', scale=.7, color='green');
plt.title('Distribution of Rides through out the day.', fontsize=18, y=1.01);
plt.xlabel('Hours', fontsize=16);
plt.ylabel('Rides(%)', fontsize=16);
plt.xticks(hour_ticks,hour_ticks_labels);
<div id="intro", style="font-size:18px; font-family:cursive">
From the above two plots we can observe that -
- There are two rush hours 8A.M. and 5P.M.
- From starting of a day to 4A.M. we see less amount of rides. And most of rides around 85% happen from 7A.M. to 19P.M. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# get the daily ride trend.
df.start_date = pd.to_datetime(df.start_date, format='%Y-%m-%d %H:%M:%S.%f')
df.groupby('start_date').agg({'bike_id':'count'}).plot(style='-', legend=False, figsize=(15,7), color='brown');
plt.title('Daily Ride Trends\n', fontsize=18);
plt.suptitle('\n\nYear 2019\n', fontsize=14);
plt.xlabel('\nMonth (2019)', fontsize=16);
plt.ylabel('Number of Rides/Day', fontsize=16);
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above plot we can say that daily trend has very high frequency and varies from month to month. Again we can see that every month has atleast four peaks of minimums and we can say that those are due to weekends.
- The trend differ from month to month and we can see that in April and July rides were popular. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# Draw a pie chart for user_type visualization.
colors = ['#BDFCC9','#40E0D0','#00C5CD', '#B0E0E6', '#AEEEEE']
user_type = df.user_type.value_counts()
explode = (0, 0.1)
plt.pie(user_type, explode=explode, labels=user_type.index, colors=colors, shadow=True, autopct='%1.1f%%', startangle=90);
plt.axis('equal'); # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Rides proportion per User type', fontsize=20);
plt.suptitle('\n\n\nYear 2019 \n',fontsize=14);
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plot we can see that -
- Most users of Bay Wheel are subscribers, it is around 80.6%
- Customer segment consists of only 19.4% of the rides. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
df.duration_minute.max()
# Plot a histogram for duration_minute column.
plt.hist(data=df, x='duration_minute');
plt.xlabel('Trip Duration in Minute', fontsize=16);
plt.ylabel('Number of Trips', fontsize=16);
plt.title('Trip Duration VS Trip Counts', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
- we can see that most of the trips are less than 150 minutes. </div>
# Check the range of duration_hour values.
df.duration_minute.describe(percentiles=[.99])
<div id="intro", style="font-size:18px; font-family:cursive">
- we can see that 99% values of trip duration are less than 70 minutes. So we create plot till 90 minutes. </div>
# Plot the histogram.
bins = np.arange(0, 90, 1)
ticks = np.arange(0, 90, 5)
plt.hist(data=df, x='duration_minute', bins=bins);
plt.xticks(ticks);
plt.xlabel('Trip Duration in Minute', fontsize=16);
plt.ylabel('Counts', fontsize=16);
plt.title('Trip Durations Vs Count', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plots we observe -
- Maximum duration of trip is 1437.167 minutes.
- we can see that 6 and 7 minutes ride durations take higher frequency. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
print('There are {} starting station.'.format(len(df.start_station_name.unique())))
print('There are {} ending station.'.format(len(df.end_station_name.unique())))
# See the statistics of start_station_name. How many stations are there of Bay Wheel.
df.start_station_name.value_counts().sort_values().plot(kind='barh', fontsize=13, figsize=(10,100));
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('Start Station Name', fontsize=16).set_visible(False);
plt.title('Available start station VS number of rides/Year\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- In the above plot we have 447 starting station and number of rides through those station per year. </div>
</br>
# See the statistics of start_station_name. How many stations are there of Bay Wheel.
df.end_station_name.value_counts().sort_values().plot(kind='barh', fontsize=13, figsize=(10,100));
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('End Station Name', fontsize=16).set_visible(False);
plt.title('Available end station VS number of rides/Year\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- In the above plot we have 447 ending station and number of rides through those station per year. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
We observe that -
- There are 447 starting stations and 447 ending stations available.
- There are some stations which have around 50K rides per year and there are also this kind of stations where we see only 5 rides in a year.
- We see that there are many starting and ending stations. So for policy and business improvement it will be better to look into the most and least popular stations. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# Get the top 30 most popular start_station_name.
df.start_station_name.value_counts().head(30).sort_values().plot(kind='barh', fontsize=13, figsize=(10,10), color='g');
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('Start Station Name', fontsize=16);
plt.title('Top 30 Most Popular Start Station in 2019\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- We see that 'Market St at 10th St' is the most popular stations with around 46000 rides in the year 2019. Top 3 most used starting stations are -
- Market St at 10th St
- Berry St at 4th St
- San Francisco Caltrain (Townsend St)
</div>
# Get the top 20 least popular start_station_name.
df.start_station_name.value_counts().sort_values().head(20).plot(kind='barh', fontsize=13, figsize=(10,10), color='pink');
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('Start Station Name', fontsize=16);
plt.title('Top 20 Less Popular Start Station in 2019\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above plot we can see that there are 5 starting station with less than 5 rides in the year 2019!!! Top 5 least used starting stations are -
- SF Test Station
- Philly Demo
- San Jose Depot
- Mercado Way at Sierra Rd
- Prototype Lab </div>
# Get the top 30 most popular end_station_name.
df.end_station_name.value_counts().head(30).sort_values().plot(kind='barh', fontsize=13, figsize=(10,10), color='g');
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('End Station Name', fontsize=16);
plt.title('Top 30 Most Popular End Station in 2019\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above plot we can see that 'San Francisco Caltrain Staion 2' is the most popular ending station in 2019. Top 3 most popuar destination stations are -
- San Francisco Caltrain Staion 2
- San Francisco Caltrain
- San Francisco Ferry Building
</div>
# Get the top 20 least popular end_station_name.
df.end_station_name.value_counts().sort_values().head(20).plot(kind='barh', fontsize=13, figsize=(10,10), color='pink');
plt.xlabel('Number of rides/Year', fontsize=16);
plt.ylabel('End Station Name', fontsize=16);
plt.title('Top 20 Less Popular End Station in 2019\n', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive">
- From the above plot we can see that there are 6 ending station with less than 5 rides in a year!!! Five least popular ending stations are -
- Emeryville Depot
- Philly Demo
- San Jose Depot
- Mercado Way at Sierra Rd
- Prototype Lab </div>
df.info(null_counts=True)
</br>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
Above we explore the variables of Bay Wheel's Trip dataset of the year 2019. After Univariate Exploration we can observe the followings -
There are 5 stations which riders choose as starting_station or ending_station less than 5 times in a year. These stations are -
So Bay Wheel can decide to exclude these stations from their business plan.
</div>
</br>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
In this section we explore the relationships between pairs of variable in the dataset. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
# Create a bar plot to see how ride duration varies with weekdays.
sb.barplot(data=df, x='start_weekday', y='duration_minute', color=base_color, order=day_order);
sb.set(style='whitegrid', rc={'figure.figsize':(10,5)});
plt.xlabel('Day of Week', fontsize=16);
plt.ylabel('Avg. Trip Duration in Minute', fontsize=16);
plt.title('Average Ride Duration Per Day ', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plot we observe that -
- Though people are less likely to ride on weekends(Satarday, Sunday), on weekends they are more likely to ride for long times. Reason may be they have more free times on the weekends.
<div id="intro", style="font-size:18px; font-family:cursive">
# See how the trip duration varies with month.
sb.barplot(data=df, x='start_month', y='duration_minute', color=base_color, order=month_order);
plt.xticks(rotation=90);
plt.xlabel('Month', fontsize=16);
plt.ylabel('Avg. Trip Duration in Minute', fontsize=16);
plt.title('Average Ride Duration Per Month ', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plot we can see that -
- Trip duration is not much impacted by month changes.
- Average trip duration is pretty much same through the year, though in the month of 'February' we see less trip duration. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
# See how the trip duration varies with season.
plt.figure(figsize=(8,5));
sb.barplot(data=df, x='season', y='duration_minute', color=base_color, order=season_order);
plt.xticks(season_ticks, season_ticks_label);
plt.xlabel('Season', fontsize=16);
plt.ylabel('Avg. Trip Duration in Minute', fontsize=16);
plt.title('Average Ride Duration Per Season ', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plot we can see that -
- Winter season is less comfortable for riders.
- People in Summer rides for comperatively more times. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
# Creta a new dataframe for pointplot
season_df = df.groupby(['start_hour', 'season']).size().reset_index()
season_df
# Plot a pointplot for horly trip visualization.
plt.figure(figsize=(15,8))
sb.pointplot(data=season_df, x='start_hour', y=0,hue='season', scale=.7, palette=['#410967', '#932567', '#DC5039', '#FBA40A']);
plt.title('Distribution of Rides through out the day.', fontsize=18, y=1.01);
plt.xlabel('Hours', fontsize=16);
plt.ylabel('Rides(%)', fontsize=16);
plt.xticks(hour_ticks,hour_ticks_labels);
<div id="intro", style="font-size:18px; font-family:cursive">
From the above plot we can see that -
- Seasonal changes affect the number of rides over the day, but it cant alter the pick hours. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
# See the trends of trip by user_type. See how trips are varied with user_type from month to month.
ax = sb.countplot(data=df, x='start_month', hue='user_type', order=month_order, palette=['#8C4843','#4274AD']);
# Set style and label the plot
sb.set(rc={'figure.figsize':(15,7)}, style='whitegrid');
plt.title('Monthly Trend of Rides by User Type\n', fontsize=18);
plt.suptitle('\n\nYear 2019\n',fontsize=13);
plt.xlabel('Month', fontsize=16);
plt.ylabel('Number of rides / Year \n',fontsize=16);
ax.legend(loc = 1, framealpha = 1);
<div id="intro", style="font-size:18px; font-family:cursive">
- We can see that there is no fixed trends of rides by subscribers over the year.
- Generally there is an upward trend for rides by customers from the month of 'May' to 'December'. </div>
# Creta a new dataframe for pointplot
user_df = df.groupby(['start_month', 'user_type']).size().reset_index()
# Make a pointplot for the visualization of rides by user_type.
plt.figure(figsize=(15,8))
palette = {'Subscriber':'purple', 'Customer':'blue'}
ax = sb.pointplot(data=user_df, y=0, x='start_month', hue='user_type', palette=palette, scale=.7);
plt.title('Monthly Trend of Rides by User Type\n', fontsize=18, y=1.01);
plt.xlabel('Month', fontsize=16);
plt.ylabel('Ride Counts', fontsize=16);
leg = ax.legend();
leg.set_title('User Type',prop={'size':16});
<div id="intro", style="font-size:18px; font-family:cursive">
- In the month of 'December' subscribers ride less and customers ride less in the month of 'February'. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
From the above two plots we can see that -
- There is no fixed trends for rides by users.
- In the month of 'December' we see less number of rides by the users.
- For customers there is generally an upward trends of ride from the month of 'May' to 'December'. We assume that popularity of Bay Wheel is growing among the customers. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
# See weekly rides trend by user_type.
fig=plt.figure(figsize=(10,5))
sb.countplot(data=df, x='start_weekday', hue='user_type',order=day_order, palette=['#8C4843','#4274AD']);
# Set the style and label the plot.
plt.xlabel('Day of Week', fontsize=16);
plt.ylabel('Counts of Rides', fontsize=16);
plt.title('Weekly Ride Statistics by User Type\n', y=1.05, fontsize=18, fontweight='bold');
plt.suptitle('\n\nYear 2019\n', fontsize=13);
<div id="intro", style="font-size:18px; font-family:cursive"> From the plot we can observe that -
- Number of rides by subscribers is always greater than number of rides by customers.
- On weekends rides by subscribers is fall down to around 50%.
- Number of rides by customers is more or less same over the week. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
#plotting a volin plot for avg. trip duration by user_type.
sb.violinplot(data=df, x='user_type', y='duration_minute', color=base_color, inner='quartile');
plt.xlabel('User Type', fontsize=16);
plt.ylabel('Trip Duration in Minute', fontsize=16);
plt.title('Varying Trip Duration with User Type ', y=1.05, fontsize=18, fontweight='bold');
<div id="intro", style="font-size:18px; font-family:cursive">
- The above violine plot become like this because there are some large value in duration_minute column. Though we cant remove them as outliers as those are possible values. If the trip duration is 1200 minutes then it is 20 hours. It is possible that one go for a long trip or one hier bike and go somewhere and after some hours he returns back. So that kind of large can be removed as outliers. </div>
# See average ride duration by user_type. See which type users are more likely to ride more time.
plt.figure(figsize=(10,5))
sb.barplot(data=df, x='user_type', y='duration_minute', palette=['#8C4843','#4274AD']);
plt.xlabel('User Type', fontsize=16);
plt.ylabel('Avg. Ride Duration(m)', fontsize=16);
plt.title('Avg. Ride Duration VS User Type', fontsize=18);
<div id="intro", style="font-size:18px; font-family:cursive"> We observe that -
- Average trip duration by Customers is almost 2 times the aaverage trip duration by Subscribers. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
From the above bivariate exploration of the Bay Wheel's System dataset we find the following interesting features about the rides those are operated by Bay Wheel in Bay Area.
</br>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
In this section we explore the relationships between three or more variables in the dataset. And do further exploration on the bivariate explorations. </div>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# Make heatmap plot for visualization.
fig=plt.figure(figsize=(15,15));
plt.suptitle("Hourly Usage during Weekdays for Customers and Subscribers",fontsize=18, y = 1.04,fontweight='bold');
#subplot1
plt.subplot(2, 1, 1);
plt.subplots_adjust(hspace = 0.3);
subscribers = df.query('user_type == "Subscriber"').groupby(['start_weekday', 'start_hour'])['bike_id'].size().reset_index(name='count')
subscribers = subscribers.pivot('start_weekday', 'start_hour', 'count')
ax = sb.heatmap(subscribers, cmap='rocket_r');
plt.title('Subscribers', fontsize=18, loc='left');
plt.xlabel('Hour of Day', fontsize=16);
plt.ylabel('Day of Week', fontsize=16);
plt.yticks(rotation=50);
plt.xticks(hour_ticks,hour_ticks_labels);
#subplot2
plt.subplot(2, 1, 2);
customers = df.query('user_type == "Customer"').groupby(['start_weekday', 'start_hour'])['bike_id'].size().reset_index(name='count')
customers = customers.pivot('start_weekday', 'start_hour', 'count')
ax = sb.heatmap(customers, cmap='rocket_r');
plt.title('Customers', fontsize=18, loc='left');
plt.xlabel('Hour of Day', fontsize=16);
plt.ylabel('Day of Week', fontsize=16);
plt.yticks(rotation=50);
plt.xticks(hour_ticks,hour_ticks_labels);
<div id="intro", style="font-size:18px; font-family:cursive">
From the above heatmap we observe -
- For subscribers peak hours are between 8A.M. to 5P.M. during working days i.e. from Monday to Friday. And on weekends the number of rides fall down.
- For customers, on working days peak hours are the same as of subscribers, but on thw weekends they more likely to rides than subscribers. </div>
<div id="intro", style="font-size:18px; font-family:cursive">
</div>
# Make a plot to visualize the relationships between the three variables.
fig=plt.figure(figsize=(7,7))
sb.pointplot(data=df, x='start_weekday', y='duration_minute', hue='user_type', dodge=0.3, linestyles=":",order=day_order);
plt.title("Average Trip Duration Over The Week for Customers and Subscribers", fontsize=18,
y = 1.04, fontweight='bold', color = 'black')
plt.xlabel('Day of Week', fontsize=16);
plt.ylabel('Avg. Trip Duration in Minute', fontsize=16);
<div id="intro", style="font-size:18px; font-family:cursive">
- Though on weekends subscribers ride less, the subscribers who ride on weekends ride for comparetively log time. And Average ride time for customers is always greater than that of subscribers.
</div>
</br>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
From the above multivariate exploration we can see some interesting features of the dataset. The features are following -
</br>
</br>
<div id="intro", style="font-size:18px; font-family:cursive">
Here we explore Bay Wheel's System dataset. This organisation operates ride service in the California Bay Area. Here we analyse Bay Wheel's 2019 trip dataset that consists of around 2.5M rides details. We find some interesting facts about the dataset. We see during working days demands more their ride services than weeends. 80% of their rides booked on working days and 20% on weekends. Their services being popular among the customer users. Spring time is the most comfortable season to riders to ride. On Tuesday of the week Bay Wheel see high demands through out the week. Two peak hours are 8A.M. and 5P.M. we assume that this is due to working time or office time.
</br>